library(dplyr)
library(INLA)
library(ggplot2)
library(patchwork)
library(inlabru)
library(mapview)
# load some libraries to generate nice map plots
library(scico)Practical 5
Aim of this practical: In this first practical we are going to look at some simple models
- Areal data [Areal_ex.qmd]{Areal_ex.qmd}
- A Linear mixed model
- A GLM model with random effects
we are going to learn:
- How to fit some commonly used models with
inlabru - How to explore the results
- How to get predictions for missing data points
Models for areal data
In this practical we are going to fit an areal model. We will:
Learn how to fit an areal model in
inlabruLearn how to add spatial covariates to the model
Learn how to do predictions
Learn how to simulate from the fitted model
Libraries to load:
The data
We consider data on respiratory hospitalizations for Greater Glasgow and Clyde in 2007. The data are available from the CARBayesdata R Package:
library(CARBayesdata)
data(pollutionhealthdata)
data(GGHB.IZ)The pollutionhealthdata contains the spatiotemporal data on respiratory hospitalizations, air pollution concentrations and socio-economic deprivation covariates for the 271 Intermediate Zones (IZ) that make up the Greater Glasgow and Clyde health board in Scotland. Data are provided by the Scottish Government and the available variables are:
IZ: unique identifier for each IZ.year: the year when the measurements were takenobserved: observed numbers of hospitalizations due to respiratory disease.expected: expected numbers of hospitalizations due to respiratory disease computed using indirect standardisation from Scotland-wide respiratory hospitalization rates.pm10: Average particulate matter (less than 10 microns) concentrations.jsa: The percentage of working age people who are in receipt of Job Seekers Allowanceprice: Average property price (divided by 100,000).
The GGHB.IZ data is a Simple Features (sf) object containing the spatial polygon information for the set of 271 Intermediate Zones (IZ), that make up of the Greater Glasgow and Clyde health board in Scotland ( Figure 1 ).
We first merge the two dataset and select only one year of data, compute the SME and plot the observed
resp_cases <- merge(GGHB.IZ %>%
mutate(space = 1:dim(GGHB.IZ)[1]),
pollutionhealthdata, by = "IZ") %>%
filter(year == 2007) %>%
mutate(SMR = observed/expected)
ggplot() + geom_sf(data = resp_cases, aes(fill = SMR)) + scale_fill_scico(direction = -1)Then we compute the adjacency matrix using the functions poly2nb() and nb2mat() in the spdep library. We then convert the adjacency matrix into the precision matrix \(\mathbf{Q}\) of the CAR model. Remember this matrix has, on the diagonal the number of e
library(spdep)
W.nb <- poly2nb(GGHB.IZ,queen = TRUE)
R <- nb2mat(W.nb, style = "B", zero.policy = TRUE)
diag = apply(R,1,sum)
Q = -R
diag(Q) = diagThe model
We fit a first model to the data where we consider a Poisson model for the observed cases.
Stage 1 Model for the response \[ y_i|\eta_i\sim\text{Poisson}(E_i\lambda_i) \] where \(E_i\) are the expected cases for area \(i\).
Stage 2 Latent field model \[ \eta_i = \text{log}(\lambda_i) = \beta_0 + \omega_i + z_i \] where
- \(\beta_0\) is a common intercept
- \(\mathbf{\omega} = (\omega_1, \dots, \omega_k)\) is a conditional Autorgressive model (CAR) with precision matrix \(\tau_1\mathbf{Q}\)
- \(\mathbf{z} = (z_1, \dots, z_k)\) is an unstrictured random effect with precision \(\tau_2\)
Stage 3 Hyperparameters
The hyperparameters of the model are \(\tau_1\) and \(\tau_2\)
NOTE In this case the linear predictor \(\eta\) consists of three components!!
After fitting the model we want to extract results.
We now look at the predictions over space.
Finally we want to compare our observations \(y_i\) with the predicted means of the Poisson distribution \(E_i\exp(\lambda_i)\)
pred$cases %>% ggplot() + geom_point(aes(observed, mean)) +
geom_errorbar(aes(observed, ymin = q0.025, ymax = q0.975)) +
geom_abline(intercept = 0, slope = 1)Note: Here we are predicting the mean of counts, not the counts!!! Predicting counts is the theme of the next task!
Getting prediction densities
Posterior predictive distributions, that is \(\pi(y_i^{\text{new}}|\mathbf{y})\) are of interest in many applied problems. The bru() function does not return predictive densities. In the previous step we have computed predictions for the expected counts \(\pi(E_i\lambda_i|\mathbf{y})\). The predictive distribution is then: \[
\pi(y_i^{\text{new}}|\mathbf{y}) = \int \pi(y_i|E_i\lambda_i)\pi(E_i\lambda_i|\mathbf{y})\ dE_i\lambda_i
\] where, in our case, \(\pi(y_i|E_i\lambda_i)\) is Poisson with mean \(E_i\lambda_i\). We can achieve this using the following algorith:
- Simulate \(n\) replicates of \(g^k = E_i\lambda_i\) for \(k = 1,\dots,n\) using the function generate() which takes the same input as predict()
- For each of the \(k\) replicates simulate a new value \(y_i^{new}\) using the function rpois()
- Summarise the \(n\) samples of \(y_i^{new}\) using, for example the mean and the 0.025 and 0.975 quantiles.
Here is the code:
# simulate 1000 realizations of E_i\lambda_i
expected_counts = generate(fit, resp_cases,
~ expected * exp(Intercept + space),
n.samples = 1000)
# simulate poisson data
aa = rpois(271*1000, lambda = as.vector(expected_counts))
sim_counts = matrix(aa, 271, 1000)
# summarise the samples with posterior means and quantiles
pred_counts = data.frame(observed = resp_cases$observed,
m = apply(sim_counts,1,mean),
q1 = apply(sim_counts,1,quantile, 0.025),
q2 = apply(sim_counts,1,quantile, 0.975),
vv = apply(sim_counts,1,var)
)Geostatistics
In this practical we are going to fit a geostatistical model. We will:
Learn how to fit a geostatistical model in
inlabruLearn how to add spatial covariates to the model
Learn how to do predictions
Learn how to simulate from the fitted model
Libraries to load:
library(dplyr)
library(INLA)
library(inlabru)
library(sf)
library(terra)
# load some libraries to generate nice map plots
library(scico)
library(ggplot2)
library(patchwork)
library(mapview)
library(tidyterra)The data
In this practical, we will explore data on the Pacific Cod (Gadus macrocephalus) from a trawl survey in Queen Charlotte Sound. The pcod dataset is available from the sdmTMB package and contains the presence/absence records of the Pacific Cod during each surveys along with the biomass density of Pacific cod in the area swept (kg/Km\(^2\)). The qcs_grid data contain the depth values stored as \(2\times 2\) km grid for Queen Charlotte Sound.
The dataset contains presence/absence data from 2003 to 2017. In this practical we only consider year 2003.
We first load the dataset and select the year of interest
library(sdmTMB)
pcod_df = sdmTMB::pcod %>% filter(year==2003)
qcs_grid = sdmTMB::qcs_gridThen, we create ab sf object and assigne the roght coordinate reference to it:
pcod_sf = st_as_sf(pcod_df, coords = c("lon","lat"), crs = 4326)
pcod_sf = st_transform(pcod_sf,
crs = "+proj=utm +zone=9 +datum=WGS84 +no_defs +type=crs +units=km" )We convert the covariate into a raster and assign the same coordinate reference:
depth_r <- rast(qcs_grid, type = "xyz")
crs(depth_r) <- crs(pcod_sf)Finally we can plot our dataset. Note that to plot the raster we need to upload also the tidyterra library.
ggplot()+
geom_spatraster(data=depth_r$depth)+
geom_sf(data=pcod_sf,aes(color=factor(present))) +
scale_color_manual(name="Occupancy status for the Pacific Cod",
values = c("black","orange"),
labels= c("Absence","Presence"))+
scale_fill_scico(name = "Depth",
palette = "nuuk",
na.value = "transparent" ) + xlab("") + ylab("")The model
We first fit a simple model where we consider the observation as Bernoulli and where the linear predictor contains only one intercept and the GR field defined through the SPDE approach. The model is defined as:
Stage 1 Model for the response \[ y(s)|\eta(s)\sim\text{Binom}(1, p(s)) \] Stage 2 Latent field model \[ \eta(s) = \text{logit}(p(s)) = \beta_0 + \omega(s) \] with \[ \omega(s)\sim \text{ GF with range } \rho\ \text{ and maginal variance }\ \sigma^2 \]
Stage 3 Hyperparameters
The hyperparameters of the model are \(\rho\) and \(\sigma\)
NOTE In this case the linear predictor \(\eta\) consists of two components!!
The workflow
When fitting a geostatistical model we need to fullfill the following tasks:
Build the mesh
Define the SPDE representation of the spatial GF. This includes defining the priors for the range and sd of the spatial GF
Define the components of the linear predictot. This includes the spatial GF and all eventual covariates
Define the observation model using the
bru_obs()functionRun the model using the
bru()function
1. Building the mesh
The first task, when dealing with geostatistical models in inlabru is to build the mesh that covers the area of interest. For this purpose we use the function fm_messh_2d.
One way to build the mesh is to start from the locations where we have observations, these are contained in the dataset pcod_sf
mesh = fm_mesh_2d(loc = pcod_sf, # Build the mesh
cutoff = 2,
max.edge = c(7,20), # The largest allowed triangle edge length.
offset = c(5,50)) # The automatic extension distance
ggplot() + gg(mesh) + geom_sf(data= pcod_sf, aes(color = factor(present))) + xlab("") + ylab("")2. Define the SPDE representation of the spatial GF
To define the SPDE representation of the spatial GF we use the function inla.spde2.pcmatern. This takes as input the mesh we have defined and the PC-priors definition for \(\rho\) and \(\sigma\) (the range and the marginal standard deviation of the field).
PC priors Gaussian Random field are defined in (Fuglstad et al. 2018). From a practical perspective for the range \(\rho\) you need to define two paramters \(\rho_0\) and \(p_{\rho}\) such that you believe it is reasonable that \[ P(\rho<\rho_0)=p_{\rho} \] while for the margianal variance \(\sigma\) you need to define two parameters \(\sigma_0\) and \(p_{\sigma}\) such that you believe it is reasonable that \[ P(\sigma<\sigma_0)=p_{\sigma} \]
You can use the following function to compute and plot the prior distributions for the range and sd of the Matern field.
dens_prior_range = function(rho_0, p_alpha)
{
# compute the density of the PC prior for the
# range rho of the Matern field
# rho_0 and p_alpha are defined such that
# P(rho<rho_0) = p_alpha
rho = seq(0, rho_0*10, length.out =100)
alpha1_tilde = -log(p_alpha) * rho_0
dens_rho = alpha1_tilde / rho^2 * exp(-alpha1_tilde / rho)
return(data.frame(x = rho, y = dens_rho))
}
dens_prior_sd = function(sigma_0, p_sigma)
{
# compute the density of the PC prior for the
# sd sigma of the Matern field
# sigma_0 and p_sigma are defined such that
# P(sigma>sigma_0) = p_sigma
sigma = seq(0, sigma_0*10, length.out =100)
alpha2_tilde = -log(p_sigma)/sigma_0
dens_sigma = alpha2_tilde* exp(-alpha2_tilde * sigma)
return(data.frame(x = sigma, y = dens_sigma))
}Here are some alternatives for defining priors for our model
spde_model1 = inla.spde2.pcmatern(mesh,
prior.sigma = c(.1, 0.5),
prior.range = c(30, 0.5))
spde_model2 = inla.spde2.pcmatern(mesh,
prior.sigma = c(10, 0.5),
prior.range = c(1000, 0.5))
spde_model3 = inla.spde2.pcmatern(mesh,
prior.sigma = c(1, 0.5),
prior.range = c(100, 0.5))And here we plot the different priors for the range:
ggplot() + geom_line(data = dens_prior_range(30,.5), aes(x,y, color = "model1")) +
geom_line(data = dens_prior_range(1000,.5), aes(x,y, color = "model2")) +
geom_line(data = dens_prior_range(100,.5), aes(x,y, color = "model3")) and for the sd:
ggplot() + geom_line(data = dens_prior_range(1,.5), aes(x,y, color = "model1")) +
geom_line(data = dens_prior_range(10,.5), aes(x,y, color = "model2")) +
geom_line(data = dens_prior_range(.1,.5), aes(x,y, color = "model3")) 3. Define the components of the linear predictor
We have now defined a mesh and a SPDE representation of the spatial GF. We now need to define the model components:
cmp = ~ Intercept(1) + space(geometry, model = spde_model3)NOTE since the dataframe we use (pcod_sf) is an sf object the input in the space() component is the geometry of the dataset.
4. Define the observation model
Our data are Bernulli distributed so we can define the observation model as:
formula = present ~ Intercept + space
lik = bru_obs(formula = formula,
data = pcod_sf,
family = "binomial")5. Run the model
Finally we are ready to run the model
fit1 = bru(cmp,lik)Extract results
Hyperparameters
Spatial prediction
We now want to extract the estimated posterior mean and sd of spatial GF. To do this we first need to define a grid of points where we want to predict. We do this using the function fm_pixel() which creates a regular grid of points covering the mesh
pxl = fm_pixels(mesh)then compute the prediction for both the spatial GF and the linear predictor (spatial GF + intercept)
preds = predict(fit1, pxl, ~data.frame(spatial = space,
total = Intercept + space))Finally, we can plot the maps
ggplot() + geom_sf(data = preds$spatial,aes(color = mean)) + scale_color_scico() + ggtitle("Posterior mean")ggplot() + geom_sf(data = preds$spatial,aes(color = sd)) + scale_color_scico() + ggtitle("Posterior sd")Note The posterior sd is lowest at the observation points. Note how the posterior sd is inflated around the border, this is the “border effect” due to the SPDE representation.
Instead of predicting over a grid covering the whole mesh, we can limit our predictions to the points where the covariate is defined. We can do this by defining a sf object using coordinates in the object depth_r.
pxl1 = data.frame(crds(depth_r),
as.data.frame(depth_r$depth)) %>%
filter(!is.na(depth)) %>%
st_as_sf(coords = c("x","y"))Instead of computing the posterior mean and standard deviations of the estimated surface, one can also simulate possible realizations of such surface. This will give the user a better idea of the type of realized surfaces one can expect. We can do this using the function generate().
# we simulate 4 samples from the
gens = generate(fit1, pxl1, ~ (Intercept + space),
n.samples = 4)
pp = cbind(pxl1, gens)
pp %>% select(-depth) %>%
pivot_longer(-geometry) %>%
ggplot() +
geom_sf(aes(color = value)) +
facet_wrap(.~name) +
scale_color_scico(direction = -1) +
ggtitle("Sample from the fitted model")An alternative model
We now want to check if the depth covatiate has an influende on the probability of presence. We do this in two different models
- Model 1 The depth enters the model in a linear way. The linear predictor is then defined as:
\[ \eta(s) = \text{logit}(p(s)) = \beta_0 + \omega(s) + \beta_1\ \text{depth}(s) \]
- Model 1 The depth enters the model in a non linear way. The linear predictor is then defined as:
\[ \eta(s) = \text{logit}(p(s)) = \beta_0 + \omega(s) + f(\text{depth}(s)) \] where \(f(.)\) is a smooth function. We will use a RW2 model for this.
We now want to fit Model 2 where we allow the effect of depth to be non-linear. To use the RW2 model we need to group the values of depth into distinct classe. To do this we use the function inla.group() which, by default, creates 20 groups. The we can fit the model as usual
# create the grouped variable
depth_r$depth_group = inla.group(values(depth_r$depth_scaled))
# run the model
cmp = ~ Intercept(1) + space(geometry, model = spde_model3) +
covariate(depth_r$depth_group, model = "rw2")
formula = present ~ Intercept + space + covariate
lik = bru_obs(formula = formula,
data = pcod_sf,
family = "binomial")
fit3 = bru(cmp, lik)
# plot the estimated effect of depth
fit3$summary.random$covariate %>% ggplot() + geom_line(aes(ID,mean)) +
geom_ribbon(aes(ID, ymin = `0.025quant`,
ymax = `0.975quant`), alpha = 0.5)